Presented at

Session-based recommender Systems: Hands-on GRU4Rec

Frederick Ayala Gómez, PhD Student in Computer Science at ELTE University. Visiting Researcher at Aalto's Data Mining Group

Let's keep in touch!

Twitter: https://twitter.com/fredayala
LinkedIn: https://linkedin.com/in/frederickayala
GitHub: https://github.com/frederickayala


  • Few notes:

    • This notebook was tested on Windows and presents how to use GRU4Rec
    • The paper of GRU4Rec is: B. Hidasi, et al. 2015 “Session-based recommendations with recurrent neural networks”. CoRR
    • The poster of this paper can be found in http://www.hidasi.eu/content/gru4rec_iclr16_poster.pdf
    • For OSx and Linux, CUDA, Theano and Anaconda 'might' need some extra steps
    • On Linux Desktop (e.g. Ubuntu Desktop ), be careful with installing CUDA and NVIDIA drivers. It 'might' break lightdm 🙈🙉🙊
    • An NVIDIA GEFORCE GTX 980M was used
    • The starting point of this notebook is the original python demo file from Balázs Hidasi's GRU4REC repository.
    • It's recommended to use Anaconda to install stuff easier
  • Installation steps:

    • Install CUDA 8.0 from https://developer.nvidia.com/cuda-downloads
    • Install Anaconda 4.3.1 for Python 3.6 from https://www.continuum.io/downloads
    • Open Anaconda Navigator
      • Go to Enviroments / Create / Python Version 3.6 and give some name
      • In Channels, add: conda-forge then click on Update index...
      • Click on your enviroment Play arrow and choose Open Terminal
      • Install the libraries that we need:
        • conda install numpy scipy pandas mkl-service libpython m2w64-toolchain nose nose-parameterized sphinx pydot-ng
        • conda install theano pygpu
        • conda install matplotlib seaborn statsmodels
    • Create a .theanorc file in your home directory and add the following:
      [global]
      device = cuda
      # Only if you want to use cuDNN
      [dnn]
      include_path=/path/to/cuDNN/include
      library_path=/path/to/cuDNN/lib/x64
  • Get the GRU4Rec code and the dataset
    • GRU4Rec:
    • YOOCHOOSE Dataset:
    • To get the training and testing files we have to preprocess the original dataset.
      • Go to the terminal that is running your anaconda enviroment
      • Navigate to the GRU4Rec folder
      • Edit the file GRU4Rec/examples/rsc15/preprocess.py and modify the following variables:
        • PATH_TO_ORIGINAL_DATA The path to the input raw dataset
        • PATH_TO_PROCESSED_DATA The path to where you want the output
      • Run the command: python preprocess.py
      • This will take some time, when the process ends you will have the files rsc15_train_full.txt and rsc15_test.txt in your PATH_TO_PROCESSED_DATA path
    • Place this notebook in the folder GRU4Rec/examples/rsc15/
  • That's it! we are ready to run GRU4Rec

In [1]:
# -*- coding: utf-8 -*-
import theano
import pickle
import sys
import os
sys.path.append('../..')
import numpy as np
import pandas as pd
import gru4rec #If this shows an error probably the notebook is not in GRU4Rec/examples/rsc15/
import evaluation


Using cuDNN version 5110 on context None
Mapped name None to device cuda: GeForce GTX 980M (0000:01:00.0)

In [2]:
# Validate that the following assert makes sense in your platform
# This works on Windows with a NVIDIA GPU
# In other platforms theano.config.device gives other things than 'cuda' when using the GPU
assert 'cuda' in theano.config.device,("Theano is not configured to use the GPU. Please check .theanorc. "
                                       "Check http://deeplearning.net/software/theano/tutorial/using_gpu.html")

Update PATH_TO_TRAIN and PATH_TO_TEST to the path for rsc15_train_full.txt and rsc15_test.txt respectively


In [3]:
PATH_TO_TRAIN = 'C:/Users/frede/datasets/recsys2015/rsc15_train_full.txt'
PATH_TO_TEST = 'C:/Users/frede/datasets/recsys2015/rsc15_test.txt'

data = pd.read_csv(PATH_TO_TRAIN, sep='\t', dtype={'ItemId':np.int64})
valid = pd.read_csv(PATH_TO_TEST, sep='\t', dtype={'ItemId':np.int64})

Let's take a look to the datasets


In [4]:
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)


C:\Users\frede\Anaconda3\envs\tf\lib\site-packages\IPython\html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
Sneak Peak to the dataset

In [5]:
data.head()


Out[5]:
SessionId ItemId Time
0 1 214536502 1.396857e+09
1 1 214536500 1.396857e+09
2 1 214536506 1.396857e+09
3 1 214577561 1.396857e+09
4 2 214662742 1.396868e+09

In [6]:
valid.head()


Out[6]:
SessionId ItemId Time
0 11265009 214586805 1.411993e+09
1 11265009 214509260 1.411993e+09
2 11265017 214857547 1.412007e+09
3 11265017 214857268 1.412007e+09
4 11265017 214857260 1.412007e+09

In [7]:
sessions_training = set(data.SessionId)
print("There are %i sessions in the training dataset" % len(sessions_training))
sessions_testing = set(valid.SessionId)
print("There are %i sessions in the testing dataset" % len(sessions_testing))
assert len(sessions_testing.intersection(sessions_training)) == 0, ("Huhu!"
                                                                    "there are sessions from the testing set in"
                                                                    "the training set")
print("Sessions in the testing set doesn't exist in the training set")


There are 7966257 sessions in the training dataset
There are 15324 sessions in the testing dataset
Sessions in the testing set doesn't exist in the training set

In [8]:
items_training = set(data.ItemId)
print("There are %i items in the training dataset" % len(items_training))
items_testing = set(valid.ItemId)
print("There are %i items in the testing dataset" % len(items_testing))
assert items_testing.issubset(items_training), ("Huhu!"
                                                "there are items from the testing set "
                                                "that are not in the training set")
print("Items in the testing set exist in the training set")


There are 37483 items in the training dataset
There are 6751 items in the testing dataset
Items in the testing set exist in the training set

In [9]:
df_visualization = data.copy()
df_visualization["value"] = 1
df_item_count = df_visualization[["ItemId","value"]].groupby("ItemId").sum()

In [10]:
# Most of the items are infrequent
df_item_count.describe().transpose()


Out[10]:
count mean std min 25% 50% 75% max
value 37483.0 844.042339 3155.62772 1.0 16.0 77.0 368.0 132658.0

In [11]:
fig = plt.figure(figsize=[15,8])
ax = fig.add_subplot(111)
ax = sns.kdeplot(df_item_count["value"], ax=ax)
ax.set(xlabel='Item Frequency', ylabel='Kernel Density Estimation')
plt.show()
fig = plt.figure(figsize=[15,8])
ax = fig.add_subplot(111)
ax = sns.distplot(df_item_count["value"],
             hist_kws=dict(cumulative=True),
             kde_kws=dict(cumulative=True))
ax.set(xlabel='Item Frequency', ylabel='Cummulative Probability')
plt.show()



In [12]:
# Let's analyze the co-occurrence
df_cooccurrence = data.copy()
df_cooccurrence["next_SessionId"] = df_cooccurrence["SessionId"].shift(-1)
df_cooccurrence["next_ItemId"] = df_cooccurrence["ItemId"].shift(-1)
df_cooccurrence["next_Time"] = df_cooccurrence["Time"].shift(-1)
df_cooccurrence = df_cooccurrence.query("SessionId == next_SessionId").dropna()
df_cooccurrence["next_ItemId"] = df_cooccurrence["next_ItemId"].astype(int)
df_cooccurrence["next_SessionId"] = df_cooccurrence["next_SessionId"].astype(int)

In [13]:
df_cooccurrence.head()


Out[13]:
SessionId ItemId Time next_SessionId next_ItemId next_Time
0 1 214536502 1.396857e+09 1 214536500 1.396857e+09
1 1 214536500 1.396857e+09 1 214536506 1.396857e+09
2 1 214536506 1.396857e+09 1 214577561 1.396857e+09
4 2 214662742 1.396868e+09 2 214662742 1.396868e+09
5 2 214662742 1.396868e+09 2 214825110 1.396868e+09

In [14]:
df_cooccurrence["time_difference_minutes"] = np.round((df_cooccurrence["next_Time"] - df_cooccurrence["Time"]) / 60, 2)
df_cooccurrence[["time_difference_minutes"]].describe().transpose()


Out[14]:
count mean std min 25% 50% 75% max
time_difference_minutes 23670982.0 2.476333 5.434214 0.0 0.45 0.98 2.16 108.4

In [15]:
df_cooccurrence["value"] = 1
df_cooccurrence_sum = df_cooccurrence[["ItemId","next_ItemId","value"]].groupby(["ItemId","next_ItemId"]).sum().reset_index()

In [16]:
df_cooccurrence_sum[["value"]].describe().transpose()


Out[16]:
count mean std min 25% 50% 75% max
value 3706018.0 6.387174 78.06891 1.0 1.0 1.0 2.0 71095.0

Training GRU


In [17]:
n_layers = 100
save_to = os.path.join(os.path.dirname(PATH_TO_TEST), "gru_" + str(n_layers) +".pickle")

In [18]:
if not os.path.exists(save_to):
    print('Training GRU4Rec with ' + str(n_layers) + ' hidden units')    
    gru = gru4rec.GRU4Rec(layers=[n_layers], loss='top1', batch_size=50, 
                          dropout_p_hidden=0.5, learning_rate=0.01, momentum=0.0)
    gru.fit(data)
    pickle.dump(gru, open(save_to, "wb"))
else:
    print('Loading existing GRU4Rec model with ' + str(n_layers) + ' hidden units')    
    gru = pickle.load(open(save_to, "rb"))


Loading existing GRU4Rec model with 100 hidden units

Evaluating GRU


In [19]:
res = evaluation.evaluate_sessions_batch(gru, valid, None,cut_off=20)

In [20]:
print('The proportion of cases having the desired item within the top 20 (i.e Recall@20): {}'.format(res[0]))


The proportion of cases having the desired item within the top 20 (i.e Recall@20): 0.5858170238648968

In [21]:
batch_size = 500
print("Now let's try to predict over the first %i items of our testint dataset" % batch_size)


Now let's try to predict over the first 500 items of our testint dataset

In [22]:
df_valid = valid.head(batch_size)
df_valid["next_ItemId"] = df_valid["ItemId"].shift(-1)
df_valid["next_SessionId"] = df_valid["SessionId"].shift(-1)


C:\Users\frede\Anaconda3\envs\tf\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
C:\Users\frede\Anaconda3\envs\tf\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()

In [23]:
session_ids = valid.head(batch_size)["SessionId"].values
input_item_ids = valid.head(batch_size)["ItemId"].values
predict_for_item_ids=None

In [24]:
%timeit gru.predict_next_batch(session_ids=session_ids, input_item_ids=input_item_ids, predict_for_item_ids=None, batch=batch_size)


1 loop, best of 3: 119 ms per loop

In [25]:
df_preds = gru.predict_next_batch(session_ids=session_ids, 
                      input_item_ids=input_item_ids,
                      predict_for_item_ids=None,
                      batch=batch_size)

In [26]:
df_valid.shape


Out[26]:
(500, 5)

In [27]:
df_preds.shape


Out[27]:
(37483, 500)

In [28]:
df_preds.columns = df_valid.index.values

In [29]:
len(items_training)


Out[29]:
37483

In [30]:
df_preds


Out[30]:
3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 ... 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398
214536502 -0.125554 -0.408401 0.241020 0.241020 -0.389289 -0.078933 -0.477493 0.232484 0.206817 0.244041 ... -0.428196 -0.428196 0.041961 0.098071 0.093519 0.206817 0.172675 0.241020 0.244041 0.157720
214536500 -0.288342 -0.476986 -0.033407 -0.033407 -0.438548 -0.090349 -0.567015 -0.091533 -0.091895 -0.021272 ... -0.454540 -0.454540 -0.027286 0.039859 -0.021722 -0.091895 -0.043099 -0.033407 -0.021272 -0.176717
214536506 -0.075021 -0.248366 0.054361 0.054361 -0.132804 0.056726 -0.488294 0.042792 0.003332 0.066164 ... -0.121433 -0.121433 0.014222 0.077800 0.050395 0.003332 0.040057 0.054361 0.066164 0.006859
214577561 -0.161591 -0.146989 -0.196368 -0.196368 -0.388048 -0.133704 -0.588967 -0.203824 -0.229023 -0.147396 ... -0.396826 -0.396826 -0.024899 0.042182 0.031068 -0.229023 -0.196541 -0.196368 -0.147396 -0.293953
214662742 -0.260508 -0.274328 -0.163363 -0.163363 -0.321595 -0.315299 0.224851 -0.140010 -0.107890 -0.167530 ... -0.324400 -0.324400 -0.247865 -0.334929 -0.402040 -0.107890 -0.223965 -0.163363 -0.167530 -0.226684
214825110 -0.352469 -0.434315 -0.339463 -0.339463 -0.224969 -0.303880 0.225020 -0.335964 -0.273014 -0.322670 ... -0.217162 -0.217162 -0.044710 -0.104694 -0.070252 -0.273014 -0.318397 -0.339463 -0.322670 -0.366161
214757390 -0.315072 -0.454990 -0.384481 -0.384481 -0.208142 -0.265650 0.229417 -0.396358 -0.307024 -0.370609 ... -0.199074 -0.199074 -0.028521 -0.094057 -0.071392 -0.307024 -0.350611 -0.384481 -0.370609 -0.435374
214757407 -0.278239 -0.443837 -0.200937 -0.200937 -0.136003 -0.180628 0.173040 -0.188200 -0.115801 -0.192596 ... -0.133456 -0.133456 0.050056 -0.020577 0.019239 -0.115801 -0.194293 -0.200937 -0.192596 -0.272369
214551617 -0.307604 -0.471146 -0.440282 -0.440282 -0.299050 -0.366722 -0.284910 -0.401151 -0.419496 -0.469510 ... -0.288231 -0.288231 -0.430822 -0.463243 -0.441564 -0.419496 -0.518069 -0.440282 -0.469510 -0.464533
214716935 -0.173355 -0.442302 -0.038002 -0.038002 -0.259541 -0.053066 -0.245974 -0.148393 -0.062892 -0.154969 ... -0.287357 -0.287357 -0.017233 -0.115849 -0.019306 -0.062892 -0.102994 -0.038002 -0.154969 -0.116428
214774687 -0.349229 -0.365367 -0.287775 -0.287775 -0.371978 -0.454964 -0.532969 -0.379856 -0.295802 -0.351526 ... -0.343609 -0.343609 -0.329447 -0.306853 -0.305620 -0.295802 -0.345713 -0.287775 -0.351526 -0.385350
214832672 -0.103046 -0.119319 -0.267813 -0.267813 -0.393802 -0.138544 0.021963 -0.292460 -0.302877 -0.304986 ... -0.345289 -0.345289 -0.486330 -0.348144 -0.407270 -0.302877 -0.317338 -0.267813 -0.304986 -0.358180
214836765 0.184219 -0.294348 -0.063637 -0.063637 -0.067768 0.217677 0.090786 0.175253 0.109586 0.100496 ... -0.219173 -0.219173 -0.138079 -0.134929 -0.087323 0.109586 -0.077201 -0.063637 0.100496 -0.248838
214706482 0.321659 -0.056601 -0.145869 -0.145869 0.080555 0.261597 -0.424388 -0.173043 -0.177623 -0.224756 ... -0.010602 -0.010602 -0.241670 -0.238570 -0.196229 -0.177623 -0.251743 -0.145869 -0.224756 -0.158949
214701242 0.324602 -0.034401 -0.128756 -0.128756 -0.037503 0.173491 -0.330859 0.001820 -0.036339 -0.117675 ... -0.011546 -0.011546 -0.192176 -0.223951 -0.159817 -0.036339 -0.252575 -0.128756 -0.117675 -0.217573
214826623 -0.525282 -0.427017 -0.337274 -0.337274 -0.446374 -0.457658 0.071204 -0.331847 -0.328527 -0.359193 ... -0.417747 -0.417747 -0.559224 -0.593267 -0.616087 -0.328527 -0.322811 -0.337274 -0.359193 -0.176417
214826835 -0.405615 -0.410813 -0.496546 -0.496546 -0.329861 -0.388222 -0.443270 -0.432017 -0.415632 -0.461856 ... -0.345735 -0.345735 -0.532703 -0.569730 -0.566815 -0.415632 -0.494202 -0.496546 -0.461856 -0.428607
214826715 -0.417705 -0.421942 -0.558647 -0.558647 -0.330552 -0.411174 -0.524730 -0.473508 -0.466597 -0.500524 ... -0.353741 -0.353741 -0.590674 -0.614723 -0.616763 -0.466597 -0.543737 -0.558647 -0.500524 -0.494951
214838855 -0.247946 0.056518 -0.277044 -0.277044 -0.104675 -0.309068 -0.348758 -0.318039 -0.325676 -0.260048 ... -0.036010 -0.036010 0.061677 0.110949 -0.066546 -0.325676 -0.186910 -0.277044 -0.260048 -0.285663
214576500 -0.472854 -0.522786 -0.413322 -0.413322 -0.570292 -0.427343 -0.219024 -0.261439 -0.293936 -0.376999 ... -0.588814 -0.588814 -0.485795 -0.443675 -0.383495 -0.293936 -0.480899 -0.413322 -0.376999 -0.428747
214821275 -0.198612 -0.346111 -0.470426 -0.470426 -0.371045 -0.312556 0.260633 -0.224205 -0.285291 -0.360532 ... -0.382514 -0.382514 -0.364729 -0.321556 -0.288133 -0.285291 -0.528044 -0.470426 -0.360532 -0.483965
214821371 0.127821 0.028603 -0.199935 -0.199935 -0.116037 -0.043043 -0.115451 -0.175487 -0.187561 -0.242401 ... -0.108702 -0.108702 -0.371950 -0.366135 -0.229023 -0.187561 -0.342489 -0.199935 -0.242401 -0.243297
214717089 -0.537638 -0.479905 -0.379751 -0.379751 -0.548305 -0.507286 -0.354199 -0.369188 -0.363175 -0.289343 ... -0.516003 -0.516003 0.096101 0.116100 0.410754 -0.363175 -0.379715 -0.379751 -0.289343 -0.358218
214563337 -0.496042 -0.385756 -0.121328 -0.121328 -0.377076 -0.464272 -0.056642 -0.122752 -0.151598 -0.086935 ... -0.376932 -0.376932 -0.217038 -0.301896 -0.419867 -0.151598 -0.033050 -0.121328 -0.086935 -0.042518
214706462 -0.452655 -0.433686 -0.256456 -0.256456 -0.333944 -0.451425 -0.333313 -0.205992 -0.245826 -0.163398 ... -0.343435 -0.343435 -0.270971 -0.350085 -0.407322 -0.245826 -0.190628 -0.256456 -0.163398 -0.237257
214717436 -0.439677 -0.372182 -0.161998 -0.161998 -0.365666 -0.353474 -0.067151 -0.188908 -0.190118 -0.136385 ... -0.358661 -0.358661 -0.243013 -0.312035 -0.433808 -0.190118 -0.071552 -0.161998 -0.136385 -0.154096
214743335 -0.443706 -0.360044 -0.291202 -0.291202 -0.373776 -0.416725 -0.081424 -0.235714 -0.215067 -0.224139 ... -0.389469 -0.389469 -0.288914 -0.377296 -0.420365 -0.215067 -0.282474 -0.291202 -0.224139 -0.301462
214826837 -0.305364 -0.449062 -0.443744 -0.443744 -0.332658 -0.345078 0.130431 -0.483800 -0.435925 -0.437740 ... -0.338791 -0.338791 -0.614157 -0.630368 -0.677457 -0.435925 -0.502522 -0.443744 -0.437740 -0.405414
214819762 -0.387035 -0.322836 -0.215700 -0.215700 -0.248009 -0.408297 -0.297279 -0.138703 -0.180692 -0.127779 ... -0.271403 -0.271403 -0.244806 -0.306741 -0.353461 -0.180692 -0.206872 -0.215700 -0.127779 -0.178724
214717867 -0.367961 -0.402355 -0.205072 -0.205072 -0.257731 -0.332211 0.224112 -0.216416 -0.159243 -0.258154 ... -0.281864 -0.281864 -0.072552 -0.117154 -0.095956 -0.159243 -0.272647 -0.205072 -0.258154 -0.351179
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214864663 0.131976 0.156177 -0.098190 -0.098190 0.081875 0.072065 -0.242473 -0.112268 -0.085633 -0.037557 ... 0.132334 0.132334 0.274233 0.258617 0.348002 -0.085633 -0.046966 -0.098190 -0.037557 -0.212307
214865297 -0.002189 -0.030229 -0.056327 -0.056327 -0.021826 -0.016757 -0.184023 -0.071819 -0.052943 -0.025930 ... 0.000167 0.000167 0.202035 0.220754 0.213960 -0.052943 -0.042275 -0.056327 -0.025930 -0.056303
214865295 0.381456 0.273147 0.110504 0.110504 0.165695 0.303707 -0.183743 0.059009 0.111918 0.117646 ... 0.186111 0.186111 0.370438 0.381182 0.401839 0.111918 0.106883 0.110504 0.117646 0.060742
214865653 0.208109 -0.038793 -0.038890 -0.038890 -0.003990 0.027234 -0.152063 -0.043282 -0.026087 -0.007182 ... -0.033928 -0.033928 0.049240 0.045151 0.079083 -0.026087 -0.028106 -0.038890 -0.007182 -0.066585
214865589 -0.119468 -0.076225 -0.310680 -0.310680 -0.101267 -0.137464 -0.303920 -0.363541 -0.359667 -0.325946 ... -0.078079 -0.078079 0.006807 0.053552 0.032830 -0.359667 -0.275091 -0.310680 -0.325946 -0.261073
214863285 0.067932 0.071766 0.092283 0.092283 0.013596 0.027671 -0.123431 0.049608 0.095081 0.089798 ... 0.020485 0.020485 0.336194 0.325334 0.360640 0.095081 0.108131 0.092283 0.089798 0.067193
214865587 -0.098827 -0.021514 -0.152414 -0.152414 -0.089473 -0.062692 -0.205960 -0.159017 -0.162817 -0.139829 ... -0.079009 -0.079009 0.075636 0.102667 0.085907 -0.162817 -0.123321 -0.152414 -0.139829 -0.125289
214864931 0.030086 0.138159 -0.000106 -0.000106 0.008056 -0.009064 -0.176709 -0.021386 0.004007 0.030231 ... 0.065162 0.065162 0.269706 0.260429 0.300058 0.004007 0.030204 -0.000106 0.030231 -0.029869
214865666 -0.039352 -0.033140 -0.172775 -0.172775 -0.017734 0.033995 -0.211593 -0.157555 -0.176124 -0.144093 ... -0.009440 -0.009440 -0.092107 -0.084231 -0.114151 -0.176124 -0.162920 -0.172775 -0.144093 -0.189945
214864946 0.268373 0.185872 -0.031505 -0.031505 0.154984 0.206921 -0.139802 -0.033600 -0.009959 0.035568 ... 0.172960 0.172960 0.247328 0.233907 0.320057 -0.009959 -0.008260 -0.031505 0.035568 -0.098514
214864925 -0.031723 0.055513 0.136991 0.136991 0.000814 -0.034658 -0.214832 0.130817 0.184260 0.174636 ... 0.050968 0.050968 0.475170 0.426648 0.496146 0.184260 0.162904 0.136991 0.174636 0.089835
214536251 0.282021 0.543217 0.116823 0.116823 0.206876 0.110867 -0.432958 0.084336 0.133187 0.195031 ... 0.275928 0.275928 0.540170 0.533511 0.499529 0.133187 0.199792 0.116823 0.195031 0.150634
214865679 0.313743 -0.018918 -0.066355 -0.066355 0.003284 0.044060 -0.217530 -0.079806 -0.046218 -0.023673 ... -0.033376 -0.033376 0.032888 0.039080 0.071872 -0.046218 -0.061435 -0.066355 -0.023673 -0.133435
214864430 -0.165868 -0.145093 -0.125760 -0.125760 -0.153875 -0.176402 -0.110209 -0.113888 -0.106102 -0.129742 ... -0.163898 -0.163898 -0.120980 -0.118082 -0.087570 -0.106102 -0.141139 -0.125760 -0.129742 -0.106438
214865462 0.091305 0.076658 -0.166748 -0.166748 -0.119104 -0.024135 -0.457162 -0.202025 -0.162572 -0.160049 ... -0.121747 -0.121747 0.091425 0.071060 0.087228 -0.162572 -0.159160 -0.166748 -0.160049 -0.164808
214862145 0.249451 0.000513 0.101365 0.101365 0.047055 0.068289 -0.206045 0.076308 0.097977 0.115994 ... 0.026314 0.026314 0.177946 0.171988 0.212927 0.097977 0.103769 0.101365 0.115994 0.064340
214862126 0.317351 0.010861 -0.118878 -0.118878 0.032531 0.079067 -0.295322 -0.126852 -0.096168 -0.076449 ... -0.000697 -0.000697 -0.020770 -0.022258 -0.000526 -0.096168 -0.107633 -0.118878 -0.076449 -0.172637
214863892 0.191479 0.312786 -0.080403 -0.080403 0.053734 0.076571 -0.176140 -0.111973 -0.044336 -0.055478 ... 0.072952 0.072952 0.393428 0.360947 0.422343 -0.044336 -0.089054 -0.080403 -0.055478 -0.116369
214863403 0.188748 0.201977 -0.062941 -0.062941 0.122239 0.114996 -0.131010 -0.103768 -0.044610 -0.048483 ... 0.135248 0.135248 0.268746 0.243238 0.296588 -0.044610 -0.069661 -0.062941 -0.048483 -0.081320
214862096 0.299521 -0.038340 -0.108409 -0.108409 0.009277 0.087434 -0.255374 -0.129189 -0.093714 -0.071984 ... -0.023549 -0.023549 -0.006882 0.010888 0.060425 -0.093714 -0.109992 -0.108409 -0.071984 -0.149062
214863084 0.240733 0.207181 0.106514 0.106514 0.128144 0.146688 -0.139639 0.067105 0.107850 0.132569 ... 0.142351 0.142351 0.334613 0.318397 0.330106 0.107850 0.121022 0.106514 0.132569 0.100737
214864944 0.105427 0.065958 0.007897 0.007897 0.064400 0.091358 -0.144831 -0.004420 0.013886 0.036019 ... 0.089857 0.089857 0.213278 0.198426 0.279629 0.013886 0.013903 0.007897 0.036019 -0.021042
214864942 0.076817 0.013442 -0.017801 -0.017801 0.018845 0.082511 -0.174243 -0.001086 0.002426 0.046211 ... 0.022780 0.022780 0.287096 0.284979 0.352704 0.002426 0.013178 -0.017801 0.046211 -0.027524
214855222 0.273672 0.235643 0.097567 0.097567 0.135193 0.159462 -0.001878 0.038948 0.067110 0.079303 ... 0.160761 0.160761 0.221940 0.218547 0.281422 0.067110 0.057758 0.097567 0.079303 0.026073
214856439 0.041885 0.082766 -0.054626 -0.054626 0.026354 -0.020509 -0.067841 -0.087759 -0.058003 -0.060087 ... 0.035880 0.035880 0.171769 0.139797 0.134118 -0.058003 -0.043785 -0.054626 -0.060087 -0.045068
214865655 0.470613 0.214345 -0.098470 -0.098470 0.225787 0.302351 -0.105729 -0.102093 -0.079021 -0.070420 ... 0.185142 0.185142 -0.024245 -0.033333 -0.003679 -0.079021 -0.115965 -0.098470 -0.070420 -0.191655
214862115 0.248375 -0.026887 -0.128429 -0.128429 -0.004748 0.067446 -0.152741 -0.100941 -0.086352 -0.080972 ... -0.042070 -0.042070 -0.029169 -0.014532 0.030945 -0.086352 -0.132767 -0.128429 -0.080972 -0.172283
214864948 0.016356 0.106727 0.053041 0.053041 0.012441 -0.050290 -0.156517 0.023436 0.064003 0.081732 ... 0.059975 0.059975 0.393784 0.350926 0.424669 0.064003 0.067082 0.053041 0.081732 0.004623
214856739 -0.048415 0.023714 -0.065138 -0.065138 -0.025195 -0.093067 -0.170084 -0.074495 -0.058493 -0.029203 ... 0.032659 0.032659 0.139566 0.139346 0.196924 -0.058493 -0.042972 -0.065138 -0.029203 -0.095539
214863240 0.256465 0.247890 0.133251 0.133251 0.122981 0.135430 -0.105583 0.102983 0.140856 0.141890 ... 0.128277 0.128277 0.276673 0.260213 0.328843 0.140856 0.122890 0.133251 0.141890 0.086779

37483 rows × 500 columns


In [31]:
for c in df_preds:
    df_preds[c] = df_preds[c].rank(ascending=False)

In [32]:
df_valid_preds = df_valid.join(df_preds.transpose())
df_valid_preds = df_valid_preds.query("SessionId == next_SessionId").dropna()
df_valid_preds["next_ItemId"] = df_valid_preds["next_ItemId"].astype(int)
df_valid_preds["next_SessionId"] = df_valid_preds["next_SessionId"].astype(int)
df_valid_preds["next_ItemId_at"] = df_valid_preds.apply(lambda x: x[int(x["next_ItemId"])], axis=1)
df_valid_preds_summary = df_valid_preds[["SessionId","ItemId","Time","next_ItemId","next_ItemId_at"]]
df_valid_preds_summary.head(20)


Out[32]:
SessionId ItemId Time next_ItemId next_ItemId_at
3742 11255568 214696432 1.411987e+09 214857030 468.0
3744 11255571 214858854 1.411962e+09 214858854 1.0
3746 11255572 214836819 1.411999e+09 214696434 36.0
3748 11255599 214857570 1.411962e+09 214858847 9752.0
3749 11255599 214858847 1.411963e+09 214859094 5.0
3750 11255599 214859094 1.411963e+09 214690730 3.0
3751 11255599 214690730 1.411963e+09 214859872 15.0
3753 11255661 214854705 1.412007e+09 214854705 1.0
3710 11255731 214857030 1.412027e+09 214696432 622.0
3711 11255731 214696432 1.412027e+09 214696432 1.0
3712 11255731 214696432 1.412027e+09 214587952 6.0
3713 11255731 214587952 1.412027e+09 214696432 63.0
3714 11255731 214696432 1.412027e+09 214857039 491.0
3721 11255754 214858847 1.411976e+09 214859872 9.0
3722 11255754 214859872 1.411976e+09 214858847 19.0
3723 11255754 214858847 1.411976e+09 214859094 5.0
3724 11255754 214859094 1.411976e+09 214858687 13.0
3725 11255754 214858687 1.411976e+09 214859122 4.0
3726 11255754 214859122 1.411976e+09 214859092 2.0
3727 11255754 214859092 1.411976e+09 214859122 3.0

In [33]:
cutoff = 20
df_valid_preds_summary_ok = df_valid_preds_summary.query("next_ItemId_at <= @cutoff")
df_valid_preds_summary_ok.head(20)


Out[33]:
SessionId ItemId Time next_ItemId next_ItemId_at
3744 11255571 214858854 1.411962e+09 214858854 1.0
3749 11255599 214858847 1.411963e+09 214859094 5.0
3750 11255599 214859094 1.411963e+09 214690730 3.0
3751 11255599 214690730 1.411963e+09 214859872 15.0
3753 11255661 214854705 1.412007e+09 214854705 1.0
3711 11255731 214696432 1.412027e+09 214696432 1.0
3712 11255731 214696432 1.412027e+09 214587952 6.0
3721 11255754 214858847 1.411976e+09 214859872 9.0
3722 11255754 214859872 1.411976e+09 214858847 19.0
3723 11255754 214858847 1.411976e+09 214859094 5.0
3724 11255754 214859094 1.411976e+09 214858687 13.0
3725 11255754 214858687 1.411976e+09 214859122 4.0
3726 11255754 214859122 1.411976e+09 214859092 2.0
3727 11255754 214859092 1.411976e+09 214859122 3.0
3728 11255754 214859122 1.411976e+09 214854962 8.0
3716 11255758 214820345 1.411999e+09 214854540 4.0
3719 11255758 214848658 1.411999e+09 214848658 1.0
3730 11255771 214859126 1.411993e+09 214859300 8.0
3739 11255787 214857562 1.411994e+09 214857570 3.0
3740 11255787 214857570 1.411994e+09 214857257 12.0

In [34]:
recall_at_k = df_valid_preds_summary_ok.shape[0] / df_valid_preds_summary.shape[0]
print("The recall@%i for this batch is %f"%(cutoff,recall_at_k))


The recall@20 for this batch is 0.559585

In [35]:
fig = plt.figure(figsize=[15,8])
ax = fig.add_subplot(111)
ax = sns.kdeplot(df_valid_preds_summary["next_ItemId_at"], ax=ax)
ax.set(xlabel='Next Desired Item @K', ylabel='Kernel Density Estimation')
plt.show()
fig = plt.figure(figsize=[15,8])
ax = fig.add_subplot(111)
ax = sns.distplot(df_valid_preds_summary["next_ItemId_at"],
             hist_kws=dict(cumulative=True),
             kde_kws=dict(cumulative=True))
ax.set(xlabel='Next Desired Item @K', ylabel='Cummulative Probability')
plt.show()



In [36]:
print("Statistics for the rank of the next desired item (Lower the best)")
df_valid_preds_summary[["next_ItemId_at"]].describe()


Statistics for the rank of the next desired item (Lower the best)
Out[36]:
next_ItemId_at
count 386.000000
mean 3347.033679
std 8905.124331
min 1.000000
25% 4.000000
50% 15.500000
75% 205.750000
max 37454.000000